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1  Introduction 


In  this  document  we  describe  how,  with  the  use  of  Glenda,  we  were  able  to  parallelize  the 
ocean  modeling  program  SWEM.  After  a  description  of  Glenda  and  an  introduction  to  the 
ocean  model,  we  discuss  how  we  parallelized  the  model.  We  first  had  to  determine  the 
array  dependencies  in  the  program  in  order  to  better  understand  the  way  we  were  going 
to  handle  data  propagation.  Once  we  had  a  grasp  of  how  the  data  was  to  be  propagated, 
there  was  a  need  to  develop  overlap  definitions  to  help  improve  the  readability  of  the  code. 
But,  of  course,  we  still  encountered  problems  that  had  to  be  solved.  Therefore,  we  had 
to  use  a  wide  assortment  of  debugging  techniques  to  overcome  these  problems.  With 
our  problems  finally  solved  and  the  program  now  complete,  we  tested  our  new  version  of 
SWEM  and  gathered  results. 


2  Glenda 

Glenda  was  developed  in  1993  by  Benjamin  R.  Seyfarth  and  graduate  students  (myself 
included)  from  The  University  of  Southern  Mississippi  [11].  It  was  developed  to  provide 
both  the  basic  capabilities  of  Linda^  (a  group  of  functions  which  simplify  writing  parallel 
programs)  and  the  portability  of  PVM  (Parallel  Virtual  Machine).  The  Glenda  model  was 
set  up  to  closely  parallel  that  of  Linda  -  with  a  few  exceptions.  Glenda  is  built  utilizing  the 
PVM  software  system  to  provide  the  underlying  communications  [2].  PVM  is  a  collection  of 
functions  that,  like  Linda,  allow  the  user  to  make  use  of  a  multiprocessor  system.  Glenda 
supports  most  of  the  Linda  operations  with  a  few  added  capabilities  to  utilize  PVM  more 
fully  [12]. 

2.1  PVM  -  Parallel  Virtual  Machine 

PVM  (Parallel  Virtual  Machine)  is  a  software  system  that  enables  a  collection  of  het¬ 
erogeneous  computers  to  be  used  as  a  coherent  and  flexible  concurrent  computational 


^  Linda  is  a  registered  trademark  of  Scientific  Computing  Associates. 


resource.  The  individual  computers  may  be  shared-  or  local-memory  multiprocessors, 
vector  supercomputers,  specialized  graphics  engines,  or  scalar  workstations,  that  may 
be  interconnected  by  a  variety  of  networks,  such  as  ethernet,  FDDI,  etc.  PVM  support 
software  executes  on  each  machine  in  a  user-configurable  pool  and  presents  a  unified, 
general,  and  powerful  computational  environment  of  concurrent  applications.  User  pro¬ 
grams  written  in  C  or  Fortran  are  provided  access  to  PVM  through  the  use  of  calls  to 
PVM  library  routines  for  functions  such  as  process  initiation,  message  transmission  and 
reception,  and  synchronization  via  barriers  or  rendezvous.  Users  may  optionally  control 
the  execution  location  of  specific  application  components.  The  PVM  system  transpar¬ 
ently  handles  message  routing,  data  conversion  for  incompatible  architectures,  and  other 
tasks  that  are  necessary  for  operation  in  a  heterogeneous,  network  environment  [6]. 

2.2  Linda  and  Its  Use  of  Tuple  Space 

Linda  programs  communicate  by  inserting  and  retrieving  tuples,  or  collections  of  data 
items  referenced  by  a  name,  into  a  shared  memory  area  referred  to  as  “tuple  space”  [1] 
Therefore,  before  any  further  discussion  of  Glenda  can  proceed,  the  idea  behind  “tuple 
space”  should  be  described.  ‘Tuple  space”  is  memory  in  one  or  more  computers  whose 
purpose  is  to  serve  as  a  temporary  storage  facility  for  data  being  transferred  between 
processes.  The  data  being  transferred  is  grouped  into  collections  called  “tuples”.  Each 
tuple  consists  of  a  string,  which  serves  as  the  tuple’s  identifier,  and  zero  or  more  data 
items.  The  data  can  be  scalar  variables,  array  variables,  or  real  numbers : 

•  (“Data”,  i,  sum,  A :  10,  5) 

Here,  the  tuple’s  name  is  “Data”.  It  consists  of  four  data  items.  The  variables  i  and 
sum  are  scalars.  The  variable  A  is  an  array  of  size  10.  The  final  data  item  is  the 
integer  5.  When  a  process  is  ready  to  send  a  tuple  to  another  process,  the  process  to 
which  it  will  send  the  tuple  may  not  necessarily  be  ready  to  receive  it.  Thus,  to  prevent 
the  sending  process  from  having  to  wait,  the  tuple  is  temporarily  stored  in  ‘tuple  space” 
until  the  receiving  process  is  ready  to  receive  it  I3][4]. 


2.3  Glenda  Operations 


Glenda  is  made  up  of  five  of  the  six  Linda  operations,  as  well  as  five  other  functions  unique 
to  Glenda[11].  Here  are  the  Glenda  functions. 

•  tid  =  gLmytidQ 

•  tid  =  gLspawn  ( name,  [hostname] ) 

We  join  Glenda  by  calling  gLmytid  and  use  gl_spawn  to  start  subprocesses.  gLmytid 
returns  a  PVMtask  id  number  and  enrolls  it  into  PVM.  gLspawn  returns  a  PVM  task 
id  number  for  the  spawned  process. 

•  gl.out  { name, ... ) 

•  gIJn  ( name, ... ) 

•  gIJnp  { name, ... ) 

•  gLrd  ( name, ... ) 

•  gLrdp  ( name, ... ) 

These  are  the  five  primary  Glenda  functions.  Every  tuple  has  a  character  string  for 
its  first  component,  followed  by  zero  or  more  data  items  (represented  above  by ...). 
gl.out  outputs  a  tuple  into  tuple  space  for  other  processes  to  retrieve  using  glJn  or 
gLrd.  gLinp  and  gLrdp  are  predicate  versions  of  glJn  and  gij'd  and  only  retrieve  a 
tuple  if  a  tuple  is  available. 

•  gl.outto  ( tid,  name, ... ) 

•  glJnto  ( name, ... ) 

The  functions  outto  and  into  were  added  to  make  use  of  PVM’s  multicast  capability. 
gLoutto  can  be  used  along  with  a  PVM  task  identifier  to  send  a  tuple  directly  to  a 
task.  gLinto  must  then  be  used  to  retrieve  a  tuple  sent  using  gLoutto. 


•  gi.exito 

To  exit  out  of  PVM  and  the  Glenda  tuple  server,  the  function  gLexit  must  be  used 
[12]. 


3  The  Shallow  Water  Equation  Model  (SWEM) 

SWEM  is  an  acronym  for  Shallow  Water  Equation  Model.  It  was  developed  by  Katherine 
S.  Hedstrom  of  Rutgers  University.  The  SWEM  code  is  derived  from  the  external  mode 
equations  for  the  solution  of  the  vertically-integrated  flow  which  are  part  of  the  3-D  free 
surface  models  currently  being  developed  by  Prof.  D.  Haidvogel  and  his  colleagues  at 
Rutgers  University.  The  principal  attributes  of  SWEM  are 

•  finite  differencing 

•  Arakawa  C-grid 

•  generalized  boundary-fitted  orthogonal  coordinates 

•  option  for  masking  out  land  areas 

The  generalized  orthogonal  coordinates  were  introduced  to  avoid  the  numerical  inaccu¬ 
racies  of  approximating  the  coastlines  by  a  step-like  function.  In  basin-wide  applications, 
the  complexity  of  the  domain  geometry  makes  it  virtually  impossible  to  adopt  boundary 
fitted  grids.  However,  curvilinear  coordinates  give  the  opportunity  to  concentrate  fine  spa¬ 
tial  resolution  in  areas  of  higher  interest  and  minimize  the  number  of  masked  land  points, 
indeed,  reducing  the  computational  cost  of  using  Cartesian  coordinates  over  the  whole 
domain  at  the  required  fine  resolution[10]. 

In  general,  ocean  models  describe  the  response  of  a  variable  density  ocean  to  atmo¬ 
spheric  momentum  and  heat  forcing.  This  response  can  very  simply  be  represented  in 
terms  of  eigenmodes  of  a  linearized  system  of  equations.  The  zeroth  mode  is  equivalent 
to  the  vertically-averaged  component  of  the  motion,  known  as  the  harotropic  mode. 
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The  higher  modes  are  called  haroclinic  modes  and  are  associated  with  the  higher 
order  components  of  the  vertical  density  profile  [5]. 

Ocean  models  usually  make  the  hydrostatic  approximation  in  which  the  pressure  dif¬ 
ference  between  two  points  on  the  same  vertical  line  depends  only  on  the  weight  as  if  the 
fluid  were  at  rest.  Linder  these  assumptions,  at  any  point  the  pressure  forces  depend  on 
the  thickness  of  the  water  column  above  that  point,  as  well  as  on  the  vertical  variations  of 
the  water  density.  Barotropic  models  neglect  the  3-D  structure  of  the  density  distribution 
and  assume  that  the  ocean  is  a  homogeneous  fluid.  So,  this  relation  hoids  if  the  horizon¬ 
tal  dimensions  of  the  ocean  volume  under  consideration  are  much  larger  than  the  vertical 
dimensions,  hence  the  shallow  wafer  designation  [9]. 

3.1  Barotropic  and  Reduced  Gravity  Ocean  Models 

Barotropic  models  are  interesting  and  important  for  severai  reasons.  First  of  all,  the  free 
surface  elevation  couples  directly  to  the  barotropic  mode.  One  of  the  important  data  fieids 
used  as  part  of  the  initialization  and  updating  procedures  for  reai-time  ocean  prediction 
are  satellite  altimeter  measurements^  of  the  free  surface  elevation.  Thus,  information 
from  altimeters  may  first  enter  the  ocean  modei  through  the  barotropic  mode. 

The  presence  of  free  surface  gravity  waves  represents  a  second  important  feature 
of  barotropic  models.  Gravity  waves  are  fast  surface  waves  that  propagate  at  a  speed 
c  =  where  g  is  the  gravitational  acceleration,  and  H  is  the  depth  of  the  ocean 

[9].  The  simple  explicit  finite-difference  schemes  treating  such  waves  are  subject  to  se¬ 
vere  time  step  iimitations  so  that  solution  of  the  barotropic  mode  may  lead  to  large  CPU 
requirements.  Therefore,  it  is  important  to  study  the  solution  of  this  system  with  efficient 
numerical  schemes,  before  incorporating  it  into  the  general  3-D  ocean  models. 

The  pressure  gradients  associated  with  the  free  surface  elevation  are  constant  with 
depth.  Thus,  they  form  part  of  the  zeroth  mode  or  the  vertically-averaged  mode,  and 

radar  altimeter  measures  the  distance  from  the  satellite  to  the  surface  of  the  ocean, 
and  if  the  position  of  the  satellite  in  space  is  known,  this  measurement  allows  the  ocean 
surface  deviations  from  the  level  corresponding  to  no  motion  to  be  inferred  on  time  scales 
of  a  few  days  to  years. 


appear  only  in  the  barotropic  mode  equations.  Consequently,  the  baroclinic  system  rep¬ 
resenting  the  higher-order  modes  has  no  surface  elevation  associated  with  it,  and  the 
corresponding  surface  boundary  condition  is  that  of  a  rigid  lid. 

A  particular  form  of  the  baroclinic  models  are  the  so-called  reduced  gravity  mod¬ 
els  [5].  These  traditionally  assume  a  dynamically  active  upper  layer  of  density  pi  which 
overlays  a  motionless,  infinitely  deep  layer  of  density  p2-  The  corresponding  mathe¬ 
matical  equations  are  formally  equivalent  to  the  barotropic  models  with  the  gravitational 
acceleration  g  substituted  by  g'  =  and  the  ocean  depth,  H,  by  the  thickness 

of  the  upper  layer,  h.  The  reduced  gravity  acceleration,  c/,  has  a  typical  value  of  2/3 
cms”^.  Similarly,  the  fastest  waves  contained  in  the  reduced  gravity  models  travels  at  a 
speed  c  =  y/g'h,  or  about  ten  to  one  hundred  times  slower  than  the  barotropic  gravity 
waves[9]. 

So,  as  we  form  the  model  equations  associated  with  SWEM,  we  will  refer  to  barotropic 
model  equations  with  the  knowledge  that  these  simple  substitutions  will  result  in  the  re¬ 
duced  gravity  model  formulations  used  in  our  simulation. 

3.2  Model  Equations  for  SWEM 

The  model  equations  for  the  barotropic  component  of  a  hydrostatic  ocean  are  derived 
from  the  Navier-Stokes  equations  for  incompressible  flow  on  a  rotating  Earth.  Here,  they 
are  presented  in  Cartesian  coordinates  with  constant  friction  coefficients  to  simplify  the 
arithmetical  text. 

^  .  .fV-gH 

(2) 

... 

dt  \dx  *  dy  )  ^ 

where 
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U^(U,V) 


-mass  transports  in  the  x-  and  ^/-directions,  respectively 
-free  surface  elevation 
Tw  =  [('r«;)£,  {Tw)y]  -wind  stress  components 
-bottom  stress  components 
-Corioiis  parameter  for  latitude  (j)',  Q  is  the 


n  =  m)x,  in)y] 
f  =  iQ.sincj) 


angular  rotation  rate  of  the  Earth  7X10  ^s  ^ 


H(x,y)  -topography  (bottom  depth) 

g  -acceleration  due  to  gravity 

-Laplacian  operator  in  horizontal  coordinates  x,y 
A  -coefficient  of  lateral  friction 

Now  we  must  choose  our  boundary  conditions.  The  main  criterion  is  whether  a  bound¬ 
ary  is  closed  (like  at  a  shore)  or  open  (like  at  a  strait  where  waves  and  currents  can  en¬ 
ter  and  exit  the  model  region).  At  rigid  walls  (a  coast,  the  ocean  bottom,  etc.)  no  flux 
of  momentum  is  prescribed.  That  is,  the  cross  boundary  velocity  is  set  to  zero.  An¬ 
other  condition  that  is  often  applied  is  setting  the  tangential  velocity  to  zero  at  the  lateral 
coastal  boundary.  The  vanishing  of  the  tangential  velocity  implies  the  existence  of  fric¬ 
tional  boundary  layers,  because  the  velocity  is  brought  to  zero  from  the  free  stream  value 
across  a  thin  boundary  layer;  and  in  this  layer,  friction  is  important.  At  the  ocean  bottom, 
the  effects  of  friction  are  represented  by  the  bottom  stress,  T j,.  SWEM  applies  a  linear 
drag  coefficient,  such  that  =  CdU  and  =  CdV.  At  the  sea  surface,  the  wind 


stress  represents  the  input  of  energy  from  the  atmosphere  [5]. 

Of  course,  having  a  closed  boundary  is  not  our  only  option.  However,  boundary  con¬ 
ditions  at  open  boundaries  (such  as  a  strait  or  a  gulf)  are  more  difficult  to  assign.  In 
general,  it  is  necessary  to  specify,  either  from  observations  and/or  estimates,  features 
that  enter  the  domain  and  allow  features  to  exit  without  generating  disturbances  inside 
the  modeled  region  [9]. 
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3.3  Numerical  Schemes  for  SWEM 


The  system  of  equations  (1)-(3)  has  both  parabolic  and  hyperbolic  properties,  the  former 
associated  with  the  diffusion  terms  and  the  latter  with  the  pressure  gradients  and  nonlinear 
terms.  Diffusion  of  momentum  will  lead  to  a  parabolic  partial  differential  equation.  The 
coupling  of  the  time  derivatives  to  the  pressure  gradients  will  lead  to  a  system  that  has 
hyperbolic  characteristics. 

The  numerical  schemes  for  solving  time-dependent  partial  differential  equations  fall 
generally  into  two  classes:  explicit  or  implicit.  The  term  “explicit”  denotes  a  scheme  where 
all  terms  on  the  right  hand  side  (r.h.s.)  of  system  (1)-(3)  are  evaluated  at  time  steps  n, 
n  - 1,  etc.  So,  at  any  time  the  r.h.s.is  known  from  previous  steps.  “Implicir,  on  the 
other  hand,  denotes  a  scheme  where  some  of  the  terms  on  the  r.h.s.  are  evaluated  at 
time  step  and,  thus,  we  must  transfer  these  terms  to  the  left  hand  side  (l.h.s.)  of 
the  equations  and  invert  the  corresponding  coefficient  matrix  of  the  unknown  variables  at 

Even  with  the  recent  advances  in  the  numerical  solution  of  partial  differential  equa¬ 
tions  using  the  implicit  treatment,  the  different  nature  of  new  computer  architectures  and 
the  increasing  size  of  computer  memories  are  leading  to  renewed  consideration  of  the 
explicit  treatment  of  the  barotropic  mode.  As  the  number  of  vertical  levels  or  layers  in¬ 
creases,  the  fraction  of  computer  time  spent  in  solving  the  2-D  barotropic  equations  per 
level  tends  to  decrease.  Also,  most  explicit  techniques  are  usually  fully  vectorizable  and 
“parallelizable”,  giving  strong  competition  to  the  implicit  techniques  on  computers  of  this 
and  the  next  decade  [5].  Therefore,  since  SWEM  uses  explicit  numerical  schemes  to 
solve  the  system  (1)-(3),  we  will  now  concentrate  on  explicit  time  integration. 

The  finite  difference  analog  of  equations  (1)-(3)  follows  from  the  finite  difference  ex¬ 
pressions  for  the  first  and  second  spatial  derivatives  in  the  X-  and  ^/-directions.  Denoting 
f{x,  y)  =  f{i  dx,  j  dy)  =  fij,  we  have 


dx 


—  Ui—ij)/2AX^ 


(4) 
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r 


dV 


dy 


fu_ 

dx^ 

d‘^V 

dy^ 


=  (Ui-\-\J  +  Ui-ij  2Uij)/AX  , 

=  (Vi,j+l  +  ^J-1  ”■  2Vi,i)/^y  • 


(5) 

(6) 
(7) 


In  order  to  obey  the  stability  conditions  of  the  finite  difference  scheme,  the  terms  of  equa¬ 
tions  4-5  are  computed  aXt  =  n,  and  6-7  are  computed  at  ^  =  n  —  1.  Similarly, 
the  pressure  and  Coriolis  terms  are  computed  aXt  =  n,  while  the  bottom  stress,  Ti,,  is 
computed  at  i  =  n  —  1  [7]. 

By  locating  the  variables  on  a  staggered  mesh  called  the  C-grid,  the  pressure  p  and 
height  h  variables  are  located  at  the  center  of  the  mesh  boxes,  and  the  mass  transports 
U  and  V  at  the  center  of  the  box  boundaries  facing  the  x  and  y  directions,  respectiveiy. 


The  Arakawa  C-grid  (shown  in  figure  1)  [7]  is  the  grid  modei  being  used.  The  thick 
outer  line  shows  the  position  of  the  model  boundary.  The  points  inside  the  boundary  are 
those  which  are  advanced  in  time  using  the  model  physics.  The  points  on  the  boundary 
and  those  on  the  outside  must  be  supplied  by  the  boundary  conditions. 

The  two-dimensional  model  fields  are  carried  in  three-dimensionai  arrays  where  the 
third  array  index  refers  to  the  two  time  levels  (old  and  new).  The  integers  i  and  j  are 
used  throughout  the  model  to  index  the  two  spatial  dimensions: 

•  i  Index  variable  for  the  ^  direction. 

•  j  index  variable  for  the  rj  direction. 

The  range  of  ^  is  1  to  L  and  the  range  of  7)  is  1  to  M  [7]. 
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Figure  1:  The  Arakawa  C-grid. 
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4  The  Parallelization  Process 


The  process  of  parallelizing  the  SWEM  code  consisted  of  many  steps.  First  was  determin¬ 
ing  the  dependencies  between  adjacent  array  cells  within  the  primary  arrays.  Next,  we 
had  to  determine  both  how  and  where  the  data  was  to  be  propagated  between  workers. 
We  then  had  to  derive  overlap  definitions  to  define  the  amount  of  data  to  be  shared  be¬ 
tween  workers.  Once  initial  coding  was  completed,  many  unforeseen  problems  resulted 
in  which  a  variety  of  debugging  methods  were  devised  to  aid  in  solving  them.  Lastly,  when 
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all  of  the  coding  was  completed,  we  tested  our  parallel  model  for  speedup. 

4.1  Parallelization  Strategy  for  SWEM 

The  SWEM  code  is  comprised  of  three  major  parts.  The  first  part  is  the  initialization 
phase  where  all  of  the  primary  variables  are  given  starting  values.  Part  two  consists  of 
three  nested  loops  which  do  the  actual  calculations.  The  final  part  is  the  output  phase 
where  all  necessary  data  is  recorded. 

Our  first  objective  was  to  determine  which  of  these  parts  lent  themselves  to  be  paral¬ 
lelized.  Since  the  first  and  last  parts  made  up  only  a  small  percentage  of  the  computation 
time,  they  were  left  essentially  unchanged.  The  bulk  of  the  code’s  work  was  being  done 
by  the  three  loops,  and  it  was  this  portion  of  the  code  that  was  to  be  parallelized. 

The  primary  data  consists  of  three  3-dimensional  arrays  -  ubar ,  vbar ,  and  zeta 
-  along  with  a  multitude  of  scalars  and  2-dimensional  arrays  used  for  intermediate  calcu¬ 
lations.  The  3-dimensional  arrays  were  of  size  RowJength  *  ColumnJength  *  2,  with  the 
2  in  the  third  dimension  representing  the  old  and  the  new  values  for  the  two  dimensional 
components.  For  example,  ubar(i,j,kold)  represents  the  old  value  for  the  two  dimen¬ 
sional  ubar(i,j),  while  ubar(i,j,knew)  represents  the  new  value.  As  computations  with 
these  arrays  proceed,  the  data  for  knew  is  usually  overwritten  with  the  data  from  the 
previous  time  step  and  the  values  of  kold  and  knew  are  switched. 

The  parallel  model  we  incorporated  into  SWEM  was  the  master-worker  scheme, 
where  each  worker  would  do  exactly  the  same  calculations,  but  on  a  different  portion 
of  the  data  [4].  This  scheme  not  only  seemed  to  be  best  suited  for  SWEM’s  structure  of 
calculation,  but  also  was  well  suited  for  SWEM’s  structure  of  data.  The  data  -  primarily 
two  and  three  dimensional  arrays  -  could  be  divided  among  all  of  the  workers,  and  it  was 
this  division  that  determined  how  much  calculation  time  would  be  spent  by  each  worker. 
Each  worker  would  basically  receive  its  data  from  the  master  program,  execute  the  three 
nested  loops,  and  return  its  results  back  to  the  master  program. 


4.2  Determining  Array  Dependencies 

Many  of  SWEM’s  calculations  consist  of  averages  or  differences  among  array  cells  which 
are  nearest  neighbors.  Some  even  depended  on  next  nearest  neighbors.  From  this,  it 
was  evident  that  each  worker  would  have  to  share  data  with  an  adjoining  worker.  There¬ 
fore,  we  had  to  determine  how  much  data  was  to  be  shared  between  the  workers  to  ensure 
correct  calculations.  One  possible  method  of  doing  this  would  be  to  painstakingly  read 
every  line  of  code  to  determine  which  arrays  depended  on  which  other  arrays.  Seeing 
that  there  were  on  the  order  of  fifty  arrays  to  check,  doing  this  by  hand  was  out  of  the 

question. 

To  help  speed  things  along,  we  wrote  a  simple  array  dependency  program  which 
scanned  a  program  and  determined  each  array’s  dependency  on  other  arrays.  Before 
using  this  array  dependency  program,  we  first  had  to  insert  into  each  subroutine  blocks  of 
commented  lines  consisting  of  an  array  name,  its  indices,  and  any  arrays  that  it  depended 
on  in  a  single  calculation.  Using  this  program  we  rapidly  determined  that  dependencies 
spanning  two  rows  and  columns  existed  for  the  three  main  arrays  based  on  inputs  and 
outputs  of  complete  subroutines.  For  example,  in  this  calculation 

tmp2(i,j)  =  inDon_r(i,j  ,krhs)  * 

(ubar(i+l, j .krhs)  -  ubar(i, j »krhs)) 

the  array  tmp2  depends  on  two  arrays  -  mDoil_r  and  ubar.  However,  calcu¬ 
lation  of  an  array  cell  of  tmp2  depends  on  mDon_r  and  two  elements  of  ubar. 
This  represents  a  neighbor  dependency  along  each  row  of  ubar .  Also,  we  do  not  know 
exactly  which  arrays  these  two  depend  on.  To  determine  this,  we  would  place  these  com¬ 
ments  below  this  statement, 
c 

cdep  tmp2(i ,j )  ~ mDomr (0,0,0)  ubarCO :  1 ,0,0) 
c 

The  ’dep’  tells  the  array  dependency  program  that  this  comment  line  is  to  be  pro¬ 
cessed.  Next  is  the  array  to  be  checked  along  with  its  dimensions.  The  represents 


’depends  on’  and  was  used  to  help  the  programmer  distinguish  the  line  from  the  original 
assignment  statement.  Following  the  ’~’  symbol  are  the  arrays  that  trn.p2  depends 
on  in  the  calculation.  Inside  the  parentheses,  the  numbers  represent  the  neighbor  de¬ 
pendencies  within  an  array.  Essentially,  a  descriptor  of  the  form 

-  number_of_left_neighbors  :  number_of _right_neighbors 

is  used  for  each  dimension,  unless  there  are  no  neighbor  dependencies  at  all,  in  which 
case  a  ’0’  is  used.  For  example,  (0,0,0)  represents  no  neighbor  dependencies  in  three 
dimensions.  (0:1 ,0,0)  represents  a  right  neighbor  dependency  in  each  row,  and  no  neigh¬ 
bor  dependencies  in  the  other  two  dimensions.  (0,-1 :1,0)  represents  a  top  and  bottom 
dependency  in  each  column,  and  no  neighbor  dependencies  in  the  other  two  dimensions. 
When  the  dependency  program  is  run  on  our  example,  the  following  output  is  produced: 
c 

cdep  tmp2 (i ,  j ) inDon_r (0 , 0 , 0)  ubar(0 : 1 ,0,0) 
c  tmp2(i, j)  ~ 
c  mDon_r(i, j ,k) 

c  ubar(i : i+1 , j  ,k) 

c 

4.3  Data  Propagation  in  SWEM 

After  examining  all  of  the  subroutines  called  within  the  three  nested  loops  with  the  array 
dependency  program,  we  determined  that  the  largest  neighbor  dependency  was  a  two 
neighbor  dependency  in  both  the  first  and  second  dimensions.  We  then  had  to  decide 
which  method  of  dividing  the  data  we  would  use,  how  we  would  propagate  data  from  the 
master  to  the  workers,  and  how  we  would  propagate  data  between  workers. 

4.3.1  Dividing  the  data 

We  essentially  had  three  choices  of  how  we  could  divide  the  data,  as  shown  in  Figure 
2:  1)  divide  the  data  by  rows,  2)  divide  the  data  by  columns,  or  3)  divide  the  data  by 
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Division  by  rows  Division  by  columns  Division  by  rows  and  columns 

Figure  2:  Three  choices  for  dividing  the  data. 

both  rows  and  columns.  Dividing  the  data  by  rows  and  dividing  the  data  by  columns  are 
essentially  the  same.  The  only  difference  is  the  way  you  visualize  the  data. 

However,  since  the  program  was  written  in  FORTRAN,  which  uses  column  major  ar¬ 
ray  indexing,  we  decided  that  division  by  rows  would  make  for  more  efficient  data  trans¬ 
mission  between  workers.  It  would  also  make  for  easier  coding  of  the  data  transmission 
subroutines  to  be  written  using  Glenda.  With  division  by  columns,  the  data  being  sent 
would  not  be  contiguous  and  would  require  multiple  sends  to  transport  the  data  from  one 
process  to  another.  On  the  other  hand,  with  division  by  rows,  the  data  could  be  sent  in 
contiguous  blocks  which  would  require  only  one  sending  call. 

We  also  considered  dividing  the  data  by  both  rows  and  columns,  but  this  too  had  its 
problems.  First  of  all,  more  divisions  meant  that  more  sending  calls  would  be  needed  to 
send  all  of  the  data.  This  would  increase  communication  times  and  decrease  efficiency. 
But  most  importantly,  even  though  division  by  both  rows  and  columns  gives  us  less  total 
data  to  have  to  work  with,  the  difference  between  this  method  and  division  by  rows  can 
be  negligible.  To  illustrate  this,  consider  an  n  X  n  array  of  double  precision,  floating 
point  numbers  divided  between  workers.  With  division  by  rows  only,  we  have  for  the 
amount  of  data  in  bytes 

Amtjof  JData  =  n  *  [n  +  4  *  —  i)]  *  8  (8) 


14 


Number  of  Workers 

4 

16 

64 

Division  by  Rows 

89600 

128000 

281600 

Division  by  Rows 

and  Columns 

86528 

100352 

131072 

Table  1:  Amount  of  data  (in  bytes)  being  processed  per  worker. 

With  division  by  rows  and  columns,  we  have 

Amt.of  Data  =  [n  +  4^  (p  —  ^)f  *8  (9) 

Suppose  the  array  we  are  working  with  is  100  X  100,  and  we  divide  it  between  four 
workers.  With  division  by  rows  only  we  have  89600  bytes.  With  division  by  rows  and 
columns  we  have  86528  bytes.  Thus,  there  is  only  about  a  three  percent  difference  in  the 
amount  of  data  being  worked  on  by  all  of  the  workers.  Table  1  shows  how  the  amount 
of  data  being  processed  (in  bytes)  varies  as  we  increase  the  level  of  parallelization.  As 
seen  in  the  table,  as  the  number  of  CPUs  increases,  the  difference  between  division  by 
rows  and  division  by  rows  and  columns  increases  significantly.  Therefore,  if  we  run  the 
simulation  with  more  processors,  it  would  be  advisable  to  divide  the  data  by  both  rows 
and  columns  for  increased  efficiency. 

4.3.2  Propagation  from  master  to  worker 

Each  worker  is  responsible  for  computing  results  for  a  range  of  rows  of  the  major  arrays. 
After  reading  the  input  data,  the  master  program  uses  the  gl_outt  o  operation  to  send 
the  appropriate  sections  of  these  arrays  to  each  worker  [11].  At  this  point  the  master  and 
the  workers  simultaneously  complete  the  initialization  process  by  computing  scalars  and 
arrays  derived  directly  from  the  input  data. 

Once  we  decided  on  dividing  the  data  by  rows,  we  could  now  begin  to  write  the  code 
for  the  propagation  of  data  between  processes.  Transfer  of  data  between  the  master 
process  and  its  worker  processes  was  trivial.  We  simply  divided  the  data  into  blocks  and 


sent  each  block  to  the  appropriate  worker  using  a  subroutine  written  in  C  using  Glenda 
to  transfer  the  data  between  processes.  Each  worker  will  be  responsible  for  a  particular 
block  of  data  In  each  array.  To  send  the  data  arrays  from  the  master  to  the  workers,  the 
call 

gl_outto (worker_tids [i] , "3D_arrays" , 

ubar  +  ubar_offset  :  ubar_size, 

vbar  +  vbar_offset  :  vbar_size, 

zeta  +  zeta_offset  :  zeta.size); 

was  used.  The  array  “worker_tids[z]”  contains  the  task  id  number  for  the  2’th  worker  pro¬ 
cess.  The  string  “3D.arrays”  is  the  name  of  the  data  tuple  being  sent.  Each  array  being 
sent  is  represented  by 

•  array  name  +  array  offset  for  worker  i :  block  size  for  worker  i 

To  receive  the  data  arrays  that  were  sent  from  the  master  to  the  workers,  the  call 

gl_into ("3D_arrays" , 

?  ubar  :len,?  vbar  :len,?  zeta  :len); 

will  be  used.  Again,  the  string  “3D-arrays”  is  the  name  of  the  data  tuple  being  received. 
The  “?”  tells  Glenda  to  place  the  corresponding  data  item  in  the  matching  tuple  into  the 
variable  following  the  “?”.  If  a  data  item  being  sent  to  gl_illto  is  an  array,  then  a  variable 
is  used  to  receive  the  size  of  the  array  being  sent.  Here,  the  sizes  of  the  arrays  was  not 
needed,  so  the  sizes  were  received  in  a  dummy  variable  1611.  If  some  of  the  data  being 
sent  were  scalars,  then  the  size  and  offset  would  not  be  needed  in  the  gl_Outto  call, 
thus  removing  the  need  for  a  size  variable  in  the  gl_into  call. 

4.3.3  Propagation  between  workers 

The  SWEM  workers  were  each  assigned  a  range  of  rows  to  compute  new  values  within 
the  inner  ioops.  In  an  example  with  97  rows  and  three  workers  we  divided  the  data  as 
illustrated  in  Figure  3. 
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Overlap  data 
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Overlap  data 
Overlap  data 


Figure  3:  Division  of  data  with  97  rows  among  three  workers. 


#  Worker  1  computes  new  data  for  rows  1  through  35,  worker  2  computes  new  data 

for  rows  36  through  64,  while  worker  3  computes  new  data  for  rows  65  through  97.  The 
calculations  for  worker  1  require  data  for  2  additional  rows:  36  and  37.  Also  the  calcula¬ 
tions  for  worker  2  require  data  for  rows  34  and  35.  After  computing  new  data  for  rows  36 

•  and  37,  worker  2  needs  to  send  this  data  to  worker  1  to  prepare  it  for  the  next  iteration. 

At  the  same  time  worker  1  needs  to  send  its  new  values  for  rows  34  and  35  to  worker  2 
to  prepare  it  for  the  next  iteration.  A  similar  process  occurs  between  workers  2  and  3. 


^  4.3.4  Overlap  definitions 

During  the  process  of  writing  the  code  for  the  propagation  routines,  it  soon  became  evident 
that  there  was  a  need  for  some  easy  way  to  allow  myself  to  write  “readable”  code.  Each 
time  there  was  a  call  to  gLoutto  or  gLinto,  the  size  of  the  overlap  regions  were 
•  needed  for  each  array.  Also  needed  was  the  offset  required  to  get  to  the  last  two  rows  of 

data  in  each  array  (the  overlap  with  the  next  worker).  And  finally,  in  order  to  know  how 
many  rows  of  overlap  to  send  to  an  adjacent  worker,  you  also  needed  to  know  whether 


17 


the  array’s  row  indices  began  with  1  or  0. 

To  solve  this  problem,  we  developed  fixed  equations  that  defined  exactly  which  data 
region  was  to  be  sent  to  adjoining  workers.  These  equations  would  define  the  offset  and 
size  for  each  boundary  region  to  be  propagated.  They  are  as  follows: 


worker_0  =  (worker  ==  0)  ?  2  :  0 
if  worker  <  (num_workers  -  1) 

offset  =(nuin_rows  -  worker_0) 
size  =( (array .start  ==  2)  ? 
if  worker  >  0 

offset  =(3  -  array.start)  *  array .width 
size  =(2  -  array.adj)  *  array.width 


*  array.width 
1 : 2) ^array.width 


Here,  worker _0  is  defined  to  be  equal  to  2  if  the  worker  in  question  is  the  worker 
designated  as  the  first  worker  (worker  0).  Otherwise,  its  equal  to  0.  This  variable  is  used 
to  determine  how  many  rows  from  the  end  of  the  array  the  overlap  block  begins.  If  the 
worker  in  question  is  not  the  last  worker,  then  it  must  send  its  last  two  (or  one)  rows  that  it 
is  responsible  for  calculating  to  the  next  worker.  In  this  case,  offset  and  size  are  as 
follows:  offset  defines  where  in  the  array  the  overlap  block  begins.  num_rows 
defines  how  many  rows  of  the  array  the  worker  is  working  with,  array -Width  de¬ 
fines  the  number  of  columns  in  the  array,  size  defines  how  many  array  elements  are  in 
the  overlap  block,  array  .start  is  the  starting  row  index  for  the  array.  If  it  is  equal  to 
2,  the  size  of  the  overlap  block  is  only  1  *  array.width.  If  It  Is  equal  to  0  or  1 , 
the  overlap  block  is  equal  to  2  *  array.width. 

If  the  worker  in  question  is  not  the  first  worker,  it  must  send  its  first  two  rows  that 
it  is  responsible  for  calculating  to  the  previous  worker.  The  offset  skips  past  the  first 
two  (or  three)  rows  of  overlap  data  to  get  to  the  correct  block.  The  size  is  determined 
by  array.adj  and,  of  course,  the  array.width.  If  the  array  in  question  has  a 
beginning  row  index  of  2,  it  has  an  array  adjustment  -  array  .adj  -  of  1 .  If  it  has  a 
beginning  row  index  of  0  or  1 ,  then  array.  adj  is  equal  to  0.  Thus,  the  size  will  be 
equal  to  2  *  array.width,  unless  the  array  has  a  beginning  row  index  of  2. 


Here  is  the  code  required  to  send  the  propagated  data  from  one  worker  to  another. 


worker_0  =  (^worker  ==  0)  ? 
if(  ^worker  <  (*num_workers 
{ 

zeta_offset  =(*num_rows  - 
zeta_size  =( (zeta_start 
iibar_offset  =(*iium_rows  - 
ubar_size  =( (ubar_start 
vbar_offset  =(*num_rows  - 
vbar_size  =( (vbar.start 


2  :  0; 

-  1)  ) 

worker_0)*  zeta_width; 

==  2)  ?  1  :  2)*  zeta_width; 
worker_0)  *  ubar_width; 

==  2)  ?  1  :  2)*  ubar .width; 
worker.O)  *  vbar.width; 

==  2)  ?  1  :  2)*  vbar .width 


gl.outto (worker.tids [*worker+l] , "bound.bottom" , 
zeta+zeta.of f set  :  zeta.size, 
ubar+ubar.of f set  :  ubar.size, 
vbar+vbar.of f set  :  vbar.size) ; 


if(  ^worker  >  0  ) 

{ 


zeta.offset  =(3  - 
zeta.size  =(2  - 
ubar.offset  =(3  - 
ubar.size  =(2  - 
vbar.offset  =(3  - 
vbar.size  =(2  - 


zeta.start)*  zeta.width; 
zeta.adj)*  zeta.width; 
ubar.start)*  ubar .width; 
ubar.adj)*  ubar.width; 
vbar.start)*  vbar .width; 
vbar.adj)*  vbar.width; 


gl.outto (worker.tids [^worker  -  1] , "bound.top" , 
zeta+zeta.of f set  :  zeta.size, 
ubar+ubar.of f set  :  ubar.size, 


vbar+vbar_of f set  :  vbar^size) ; 


} 


Notice  that  for  some  workers,  both  if  conditions  wiil  be  true.  These  workers  are  respon¬ 
sible  for  interior  blocks  which  have  overlapping  regions  at  both  the  beginning  of  the  array 
and  the  end  of  the  array. 

In  order  to  receive  these  blocks  of  data,  each  worker  must  also  have  similar  defini¬ 
tions  for  the  offsets  so  that  the  blocks  of  data  are  placed  in  the  correct  locations  within 
the  array.  They  are  as  follows: 

worker_0  =  (worker  ==  0)?  num.rows  +  1  :iiiim_rows  +  3; 
if  worker  <  (num_workers  -  1) 

offset  =(worker_0  -  array.start)*  array.width 
if  worker  >  0 

offset  =  0 

Here,  worker  _0  defines  how  many  rows  to  skip  in  order  to  locate  the  correct  location 
within  an  array  to  place  the  received  block.  If  the  array  in  question  has  its  row  index  begin 
with  array -Start  and  the  worker  in  question  is  not  the  last  worker,  then  that  value 
is  subtracted  from  worker_0  to  get  the  correct  offset  for  that  particular  array.  If  the 
worker  in  question  is  not  the  first  worker,  then  the  offset  is  0  for  all  arrays.  An  example  of 
the  code  designed  to  receive  the  blocks  is  as  follows: 

worker.O  =(*worker  ==  0)?  *niim_rows  +  1  :*niiin_rows  +  3 

if(*worker  <  (*nuiii_workers  -  1)  ) 

{ 

zeta_offset  =  (worker_0  -  zeta_start)  *  zeta_width; 

ubar.offset  =  (worker_0  -  ubar_start)  *  ubar .width; 

vbar.offset  =  (worker.O  -  vbar.start)  *  vbar .width 

gl.intoC'bound.top" , 

?  zeta+zeta.off set  :  zeta.size. 


1 


34 

35 

36 

37 


Woricer#! 


Figure  4:  Propagation  of  data  between  workers. 


?  ubar+ubar_of f set  :  iibar_size, 

?  vbar+vbar_off set  :  vbar_size) ; 

> 

if(*worker  >  0) 

{ 

zeta_offset  =  0; 
ubar_offset  =  0; 
vbar_offset  =  0; 
gl_iiito  (  "bound_bottom" , 

?  zeta+zeta_off set  :  zeta_size, 

?  ubar+ubar_off set  :  iibar_size, 

?  vbar+vbar_off set  :  vbar_size) ; 

} 

Again,  if  a  worker  is  not  the  first  or  last  worker,  both  if  conditions  hold.  Therefore,  each  of 
these  workers  will  require  two  receives.  The  procedure  for  propagation  of  data  between 
workers  is  demonstrated  in  Figure  4. 


4.4  Problems  Encountered 


As  already  seen,  many  problems  were  encountered  during  the  process  of  attempting  to 
parallelize  the  SWEM  code.  There  were  many  obstacles  standing  in  our  way  at  the  very 
start.  Some  of  the  problems  we  encountered  were  simple  problems  that  were  rather  easy 
to  solve,  while  others  ended  up  costing  us  valuable  time  and  effort. 

One  major  problem  was  with  the  arrays  themselves.  In  the  initial  stages  of  our  re¬ 
search,  our  knowledge  of  both  ocean  dynamics  and  the  mathematical  basis  behind  the 
model  was  rather  limited.  Many  of  the  array  names  gave  little  indication  as  to  their  pur¬ 
pose,  Names  such  as  omn_u,  mDon_r,  and  on_u  told  us  very  little  about  their 
meaning.  Also,  many  arrays  had  names  that  differed  in  only  one  or  two  characters.  Arrays 
such  as  nDom_x  and  mDonjr,  ubar  and  vbar,  and  dndx  and  dmde 
could  cause  the  program  to  produce  incorrect  data  if  they  happened  to  accidentally  be 
exchanged  for  one  another.  These  types  of  errors  are  extremely  difficult  to  find,  and  can 
cause  many  long  delays  in  the  coding  process. 

Another  problem  we  had  with  the  arrays  was  the  way  they  were  defined.  Each  array 
was  given  a  starting  index  and  an  ending  index  for  each  dimension.  Unfortunately,  these 
starting  and  ending  indices  were  usually  different  from  one  array  to  another.  These  differ¬ 
ences  caused  us  to  lose  hours  of  valuable  time  that  could  have  been  spent  programming 
rather  than  debugging.  If  an  array  size  or  an  array’s  starting  index  was  mistakingly  given 
the  wrong  value,  we  could  end  up  placing  values  past  the  end  of  an  array  and  writing  over 
half  of  our  data  in  memory  in  the  process.  Once  this  happens,  a  close  inspection  of  all  of 
the  code  is  usually  required  to  find  out  where  it  went  wrong. 

Yet  another  problem  we  had  to  deal  with  was  how  to  handle  masking  arrays  used 
in  the  original  SWEM  code  to  mask  out  any  land  areas  that  were  in  the  particular  grid 
being  modeled.^  The  SWEM  masking  arrays  consist  of  single-dimension  arrays  of  indices 
which  define  the  locations  of  land  areas  within  the  study  grid.  These  arrays  are  used 
within  SWEM  to  reference  the  land  grid  points  while  treating  the  various  two-dimensional 
data  arrays  as  single-dimension  arrays  beginning  with  index  1 .  This  makes  the  masking 


^This  was  the  only  part  of  the  original  code  that  had  to  be  adjusted. 


arrays  consist  of  index  values  which  require  careful  interpretation  to  determine  which 
indices  refer  to  the  sub-grid  processed  by  a  particular  worker. 

We  wrote  a  subroutine  that  modifies  any  masking  arrays  being  used,  and  limits  the 
masks  to  only  those  which  refer  to  data  contained  within  a  particular  worker.  It  also  sub¬ 
tracts  an  offset  from  its  indices  to  make  them  correct  for  the  worker’s  sub-arrays.  The 
amount  of  offset  within  the  worker  is  dependent  upon  whether  this  worker  is  the  first 
worker,  the  widths  of  the  arrays,  and  whether  the  arrays  being  adjusted  start  at  row  0, 1 , 
or  2  (this  problem  was  mentioned  earlier). 

4.5  Debugging  Methods  Used 

With  all  of  the  problems  we  encountered,  there  was  a  need  for  an  easy  way  to  know 
whether  or  not  the  data  being  calculated  by  the  parallel  model  was  correct.  One  way  to 
do  this  was  to  compare  the  final  output  files  of  the  parallel  version  of  SWEM  with  those 
of  the  sequential  version.  However,  if  there  were  any  differences  in  output,  there  was 
no  way  to  know  where  the  error  occured.  Therefore,  there  was  a  need  to  be  able  to 
locate  errors  at  points  within  the  code.  To  accomplish  this  task,  we  decided  to  run  the 
sequential  version  of  SWEM  and  the  parallel  version  of  SWEM  concurrently  within  the 
same  program.  The  part  of  the  program  being  parallelized  was  left  in  the  main  program 
and  was  executed  in  step  with  the  worker  processes  that  were  doing  the  parallel  part.  At 
strategic  points  within  the  master  process  and  the  worker  processes,  we  would  suspend 
calculations  and  send  data  from  the  parallel  part  to  the  sequential  part  of  the  program  for 
comparison.  Using  generic  send  and  receive  routines,  we  would  send  over  one  array  at 
a  time  and  use  a  generic  compare  routine  to  compare  it  with  its  sequential  counterpart. 
If  they  were  the  same,  we  would  try  another  array  until  we  found  the  one  causing  the 
error.  Of  course,  before  we  could  use  these  routines,  they  had  to  be  properly  debugged 
themselves.  The  problems  of  using  the  correct  array  sizes,  and  using  the  proper  starting 
index  caused  us  a  little  grief  in  the  initial  stages  of  debugging  the  code.  However,  once 
the  debugging  subroutines  were  debugged  themselves,  they  saved  us  days  (if  not  weeks) 
of  debugging  time. 


5  Testing  the  Parallel  Version  of  SWEM 


For  our  test,  we  used  a  three  CPU  SGI  340-GTXB,  operated  by  Mississippi  State  Univer¬ 
sity  Center  for  Air  Sea  Technology  (CAST).  We  were  given  sole  use  of  the  computer  for 
testing  purposes.  As  part  of  the  test,  we  ran  the  original  sequential  version  of  SWEM  to 
serve  as  a  basis  for  comparison.  Using  this  time,  we  will  be  able  to  determine  our  parallel 
version’s  speedup.  For  a  more  challenging  test,  we  also  compiled  the  original  SWEM 
code  using  the  SGI  MIPS  FORTRAN  77  “-pfa”  option  [8].  This  option  runs  the  pfa 
(POWER  Fortran  Accelerator^'^)  preprocessor  to  automatically  discover  parallelism  in 
the  source  code.  It  also  enables  the  multiprocessing  directives  for  the  SGI  340-GTXB. 
This  timing  gives  us  a  “yardstick”  upon  which  we  can  measure  our  parallel  version. 

The  timings  were  made  for  240  day  simulations. 


Our  results  are  shown  in  Table  2.  These  efficiency  values  are  not  particularly  high. 
The  primary  reason  for  this  is  that  the  array  sizes  are  relatively  small.  In  the  case  of 
the  SGI  compiler  it  requires  moderately  long  loops  to  achieve  high  efficiency  [8].  In  the 
Glenda  version  the  small  width  of  the  rows  means  that  there  is  a  relatively  small  amount 
of  time  spent  computing  between  communication  steps.  With  larger  data  sets,  we  predict 
that  both  the  SGI  Fortran  compiler  and  the  Glenda  code  would  achieve  higher  efficiencies. 
However,  it  is  unlikely  that  a  message-passing  solution  will  out-perform  the  tightly-coupled 
code  produced  by  the  compiler. 

In  a  similar  test,  we  compared  the  Glenda  version  with  a  version  of  SWEM  using  PVM 
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to  handle  the  communications.  As  expected,  the  PVM  version  was  slightly  more  efficient 
than  the  Glenda  version  -  but  only  by  about  two  percent.  This  test  was  based  on  a  48 
day  simulation  with  only  two  workers.  Due  to  hardware  limitations  and  time  constraints  at 
the  time  of  testing,  we  were  not  able  to  test  a  240  day  simulation  or  test  with  more  than 
two  workers.  However,  since  we  have  had  similar  comparative  results  using  Glenda  and 
PVM,  we  feel  that  this  test  was  an  accurate  comparison. 

6  Conclusion  and  Future  Research 

We  have  shown  that  Glenda  can  be  used  to  efficiently  parallelize  an  ocean  modeling 
program.  While  the  efficiency  is  lower  than  the  SGI  compiler’s,  the  Glenda  version  of 
SWEM  is  portable  to  a  variety  of  environments.  We  have  illustrated  how  we  divided  the 
work  among  the  workers  and  resolved  the  resulting  communication  problem. 

The  decision  to  divide  the  arrays  by  rows  resulted  in  an  efficient  solution  in  our  test 
environment.  Our  tests  were  done  with  a  moderately  small  data  set.  We  predict  that 
with  larger-sized  data  sets,  this  version  of  SWEM  will  be  efficient  up  to  perhaps  eight 
workers.  It  is  clear  that  division  by  rows  limits  the  scalability  and  at  some  point  it  would 
be  necessary  to  divide  the  data  by  rows  and  columns. 

Our  research  has  shown  that  Glenda  shows  promise  for  working  with  ocean  models. 
We  plan  to  extend  this  effort  by  testing  the  parallel  SWEM  in  new  environments.  In  par¬ 
ticular  we  are  interested  in  determining  how  efficiently  we  can  utilize  massively-parallel 
machines. 
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A  The  SWEM  Experimental  Region 


Our  experiments  analyzed  the  response  of  the  North  Atlantic  Ocean  to  the  climatological 
monthly  winds.  The  domain  is  configured  in  the  reduced  gravity  formulation  with  an  ini¬ 
tial  upper  layer  thickness  of  450  meters.  The  numerical  domain  has  closed  boundaries. 
The  boundary  fitted  coordinates  are  chosen  to  give  a  high  spatial  grid  resolution  (about 
50  kilometers)  in  the  western  subtropical  basin  in  order  to  analyze  the  circulation  of  the 
Caribbean  Sea  and  its  influence  on  the  dynamics  inside  the  Gulf  of  Mexico. 

Figure  5  illustrates  the  annual  mean  upper  layer  thickness  anomaly  (in  meters)  from 
the  numerical  experiments.  The  solution  indicates  a  double  gyre  system.  The  cyclonic 
(counterclockwise)  subarctic  gyre  is  unrealistic  due  to  the  close  boundary  configuration, 
but  the  western  intensification  of  the  anticyclonic  (clockwise)  gyre  (the  Gulf  Stream)  is 
well  reproduced.  The  current  separates  from  the  coast  in  the  proximity  of  Cape  Hatteras 
and  is  dominated  by  large  fluctuations  with  meanders  and  ring  formation,  as  supported 
by  observations  and  measurements. 

The  geometry  of  the  western  subtropical  basin  has  a  strong  effect  on  the  dynamics 
of  the  Caribbean  Sea  and  Gulf  of  Mexico.  Part  of  the  return  flow  of  the  wind  driven  gyre 
enters  the  Caribbean  Sea  from  several  openings  between  the  Antilles  Islands.  These 
branches  organize  into  a  narrow  current  (the  Yucatan  Current)  which  flows  along  the 
Mexican  Coast,  enters  the  Gulf  of  Mexico  through  the  Yucatan  Strait,  and  exits  from  the 
Florida  Strait.  There  it  rejoins  the  main  return  flow  of  the  anticyclonic  gyre  [1 0]. 
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Figure  5:  The  annual  mean  sea  surface  displacement  from  the  SWEM  simu¬ 
lations. 
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B  Parallel  Version  of  SWEM 


The  parallel  version  of  SWEM  containing  the  SWEM  FORTRAN  code,  as  weli  as  the  fiies 
containing  the  Glenda  subroutines  {master. suhs.  eg  and  worker. subs,  eg),  can 
be  downloaded  via  anonymous  ftp  at 

seahass.st.usm.edu 

The  two  files  containing  the  Glenda  subroutines  were  added  to  the  sequential  SWEM 
code  to  facilitate  the  parallelism. 

•  The  file  master. subs,  eg  contains  subroutines  needed  by  the  master  program 
to  communicate  with  the  worker  programs. 

•  The  file  worker.subs.  eg  contains  subroutines  needed  by  the  worker  programs 
to  communicate  with  the  master  program. 

AN  of  the  subroutines  were  written  in  C  and  used  Glenda  to  handle  the  interprocess  com¬ 
munication.  Within  the  SWEM  code,  these  subroutines  are  used  as  FORTRAN  subrou- 
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